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Abstract 

We present a numerical approach to calculate non-equilibrium eigenstates of a periodically time-modulated quantum system. The 
approach is based on the use of a chain of single-step time-independent propagating operators. Each operator is time-specific 
and constructed by combining the Magnus expansion of the time-dependent system Hamiltonian with the Chebyshev expansion 
of an operator exponent. A construction of a unitary matrix of the Floquet operator, which evolves a system state over the full 
modulation period, is performed by propagating the identity matrix over the period. The independence of the evolutions of basis 
vectors makes the propagation stage suitable for implementation on a parallel cluster. Once the propagation stage is completed, a 
routine diagonalization of the Floquet matrix is performed. Finally, an additional propagation round, now with the eigenvectors as 
the initial states, allows to resolve the time-dependence of the Floquet states and calculate their characteristics. We demonstrate 
the accuracy and scalability of the algorithm by applying it to calculate the Floquet states of two quantum models, namely (i) a 
synthesized random-matrix Hamiltonian and (ii) a many-body Bose-Hubbard dimer, both of the size up to 10 4 states. 
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1. Introduction 

Fast progress in manipulations with cold and ultra-cold 
atoms, quantum optics and nanoscale fabrication techniques has 
brought quantum physics in touch with technology EES). 
It is natural then that computational quantum physics plays an 
ever increasing role in explaining and guiding current experi¬ 
ments and suggesting new (4). From the computational point 
of view, the complete resolution of a coherent, i.e., an isolated 
from the environment, quantum system means the solution of 
the eigenvalue problem for the system Hamiltonian H. When 
the Hamiltonian is time-independent, this task can be executed 
by performing full diagonalization of the Hamiltonian matrix 
When the system becomes to large then the size of the matrix 
does not allow for the full diagonalization. The task, however, 
could be restricted to finding lowest energy eigenstate(s) which 
can be accomplished by using the Lanczos algorithm Q or 
more sophisticated tools, such as the Density-Matrix Renormal¬ 
ization Group (DMRG) methods 0- In cases that the system 
is periodically modulated in time, its Hamiltonian becomes a 
time-periodic matrix H(t + T) - H(t + 2n/a>) = H(t). Then, 
the dynamics of the system is governed by the set of so termed 
Floquet eigenstates mil]. These states are not eigenvectors of 
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the Hamiltonian H(t) but instead of the unitary Floquet operator 

U T = T exp[-A H(T)dr], (1) 

where T is Dyson’s time-ordering operator. This operator prop¬ 
agates the system over the one full period T of modulation, 
while the corresponding time-periodic Floquet states (see be¬ 
low) at equal times t = f muni form a time-periodic orthogo¬ 
nal basis spanning the system Hilbert space and evolving under 
the action of the time-dependent Hamiltonian. The structure 
of the unitary Floquet matrix, and thus properties of the Flo¬ 
quet states, depend on the modulation protocols and parameters. 
This is a key feature of periodically driven quantum systems 
which makes them so attractive to the theoreticians and experi¬ 
mentalists working in the filed of quantum optics, optomechan- 
ics and solid state physics 10 [TO] 33] CE2] All • Strong modu¬ 
lations can sculpt a set of non-equilibrium eigenstates which 
may drastically differ from the states exhibited by the system in 
the unmodulated, time-independent limit. Modulations allow 
to grasp novel phenomena and effects which are out of reach 
within time-idependent Hamiltonians; they can be used to cre¬ 
ate topological insulators in semiconductor wells DU, synthe¬ 
size Majorana fermions in quantum wires ED, and engineer 
gauge fields for spinless neutral atoms Da. 

The calculation of Floquet states of a large quantum system 
constitutes a challenge. The key step is a construction of the 
corresponding unitary Floquet matrix, Eq. 0 (its final diago¬ 
nalization is such a routine as, for example, the diagonalization 
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of stationary Hamiltonian matrices). The most straightforward 
way to obtain Ur is to numerically propagate the identity ma¬ 
trix over the time period T. However, the propagation with a 
time-dependent Hamiltonian operator is an issue itself, to be 
addressed in the next section. There are two ways to do so. 

The first option is to use piecewise-constant modulation 
functions. This allows to reduce the computational task to 
the diagonalization of time-independent Hamiltonians, one for 
every time interval, and the expansion of eigenvectors of a 
preceding Hamiltonian in the basis of the consecutive one. 
Such modulations were used to investigate connections be¬ 
tween integrability and thermalization mmm, and to ex¬ 
plore disorder-induced localization f20l in periodically driven 
many-body systems. With respect to the thermalization it was 
found that the modulations heat the system to the infinite tem¬ 
perature so that the system Floquet states are near uniformly 
smeared over the eigenbasis of the system in the absence of 
driving mmm. An important question that immediately 
arises is whether this is a universal phenomenon or it is related 
to the non-differentiability of the modulation function (which 
property induces the presence of all multiple frequencies ka>, 
k - 1,2,..., in the spectrum of the modulations function). Evi¬ 
dently, this question cannot be answered without going beyond 
the piecewise setup. In addition, in the view of possible exper¬ 
imental realizations, smooth continuous modulations are also 
more preferable. 

An alternative option is to expand the time-dependent Hamil¬ 
tonian into a Fourier series and, and then truncating it, by keep¬ 
ing 2F + \ harmonics kto, k = —F ,.... 0,..., F only, to reduce 
the problem to the diagonalization of a time-independent super- 
Hamiltonian EEXl- This is a reliable method to obtain Floquet 
spectrum of a system of a size up to a hundred of states. For 
larger systems, this strategy leads to a computational problem: 
The size of the super-Hamiltonian scales as N x (2 F + 1), where 
N is the dimension of the system’s Hilbert space. Computa¬ 
tional diagonalization efforts increase as [N x (2 F + l)] 1 * 3 , while 
the known diagonalization algorithms are poorly scalable. For 
a system of the size N = 10 4 , already F = 50 harmonics is too 
much; a full diagonalization of a 10 6 x 10 6 matrix is unfeasible. 
At the same time, this large number of harmonics is seemingly 
not enough to resolve faithfully the Floquet spectrum of the sys¬ 
tem 0 

Therefore, in order to calculate the Floquet state of a system 
with N ^ 10 3 states, the propagation stage has to be included 
into an algorithm. A propagation method should guarantee a 
high accuracy with respect not only to the unitarity of the time 
evolution but also with respect to the phases of complex vec¬ 
tors. That is because Floquet states appear as superpositions of 
basis vectors used to write system’s Hamiltonian. Accumulated 
phase errors will destroy the interference and lead to an incor¬ 
rect set of Floquet states. As we show in Section [7] quantum 
interference effects, together with some facts from the quantum 
chaos theory, can be used to benchmark the accuracy of an al¬ 


1 The eigenvalue spectrum of the super-Hamiltonian can be resolved with the 

accuracy 2n/M at best. This is not enough taking into account that the actual 

mean spacing between the eigenvalues is n/N. 


gorithm. 

Because of the trade-off between the accuracy and system 
size, the time of sequential vector propagation grows super- 
linearly with N. Faithful calculations of Floquet spectra of non- 
integrable systems (that are systems whose Hilbert space cannot 
be decomposed into several non-interacting low-dimensional 
manifolds ||22| ). with tens of thousands of states, can only be 
performed with scalable algorithms. 

This paper presents an algorithm to calculate the Floquet 
spectra of strongly-modulated quantum systems with N ^ 10 4 
states and its implementation on a parallel supercomputer. The 
propagation part of the algorithm is based on the combination of 
the Magnus expansion of time-dependent linear operators 1231 
and the Chebyshev expansion of operator exponents Il24l . This 
combination has been proposed in |25l . where its particular nu¬ 
merical realization, implementing a commutator-free Magnus 
scheme, was tested. We illustrate the accuracy and scalability of 
the algorithm by using two quantum models, with a synthesized 
random-matrix Hamiltonian and a many-body non-integrable 
bosonic dimer. The size of model system is limited by the di¬ 
agonalization routine only, so the algorithm can be used to cal¬ 
culate Floquet states of systems of the size up to N ~ 50 000 
states. 

The rest of the paper is organized as follows: Section [2] out¬ 
lines the theoretical background and introduces the Magnus and 
Chevyshev expansions; Section [3] describes the algorithm; in 
Section [4] we introduce model systems, apply the cluster im¬ 
plementation to calculate their Floquet states in Section [5] and 
analyze the results in Section [7] Finally we summarized our 
findings and outline further perspectives in Section]!] 

2. Theoretical background 

Floquet states. We consider quantum systems whose dynam¬ 
ics is determined by the time-dependent Schrodinger equation 

m\m) = m\m\ ( 2 ) 

where the Hamiltonian H(t) denotes a time-periodic Hermitian 
operator, H(t + T) - H(t). We assume that the system evolves 
in a finite-dimensional Hilbert space spanned by N basis vec¬ 
tors. The time evolution of the system is fully determined by a 
unitary operator U(to, t ), being the solution of the equation 

iM,U(to,t) = H(t)U(fo,t), (3) 

for the initial condition in the form of the identity matrix, 
U(Jq, to) = 11. This provides the propagator of the system, i.e., 
a unitary operator which evolves any system state from a time 
to to time to + t, U(to, t)\i//(to)) = \i//(to + t ))• A time to e [0, T] 
specifies the state of the Hamiltonian operator at the initial time, 
when, for example, the driving was switched on. This start¬ 
ing time can be absorbed into the Hamiltonian as a parameter, 
H(t,to) = H(t + to) (the propagator U(to,t) can be onbtained 
from f/(0, t) as U(to, t) = (7 + (0, to)U(0, t + to)), so for later con¬ 
venience, we set to — 0 in Eq. 0 and denote U(0, t) by U t . 
Eigenvectors of the normal matrix U T , 

U T \^) = p = l,... ,N, (4) 


2 



form an orthonormal basis in the system Hilbert space. These 
vectors could also be taken as snapshots of time-dependent vec¬ 
tors |at the time instant t — T, f/ f |<^(0)) = e~ lei,t ^ h \ip M (t)), 
with e p - HQJT. The exponents e p have the dimension of en¬ 
ergy and are termed quasienergies. Quasienergies can be de¬ 
termined up to multiples of ha> so they are conventionally re¬ 
stricted to the interval [-haj/2,hu>/2]. 

By denoting \cb p (t)) = e~ le r tlh \ip p {t)), we end up with the set 
of time-periodic Floquet states Grana 


of its action on the vector, \ifit 2 )) = exp(-zT2(f], t 2 )/h) |»A(?i))- 
This can be calculated by implementing the Chebyshev poly¬ 
nomial expansion of the operator exponent, which is based on 
a recursive iteration scheme ll24l . 

It/b+ito)) = -2/Q(7 i,/2) I fiih)) + \fi-\ih)) (10) 

with the initial conditions \foit 2 )) = Wih)) and l/ife)) = 
-iCl(ti,t 2 ) Ykoih))- Here Q(fi, ? 2 ) is a shifted and rescaled oper¬ 
ator. 


I Mt + r )> = 1^(0). (5) 

By knowing the Floquet spectrum of a system, {e p , \<p p (t))}, 
and the system initial state |/(0)>, one can calculate the state of 
the system at any instant of time t > 0, 

m)) = c n = {iiim^m- ( 6 ) 


CKtut2) = £1(tut2} ] ' ( * E + E ™\ (11) 

A E 

which has all its eigenvalues restricted to the interval [-1,1] 
M- The spectral half-span A E = (E max -E m ; n )/2 should be es¬ 
timated from the extreme eigenvalues E mln and E mm of £l(t \, t 2 ) 
operator beforehand. 

Finally, the new vector can be obtained as 


Magnus expansion. The idea of the Magnus expansion |[26l is 
to construct a time independent Hamiltonian operator D(fi, t 2 ), 
parameterized by the two times, t\ and t 2 , such that 


U(ti,t 2 ) = exp 


H 


, t 2 ) 


(7) 


The operator is given by an infinite series involving nested 
commutators El 


W I,t 2 ) = H(t i) dr\ + 


+ — 
2 


fj>f 


[H(t i ), H(t 2 )] dr 2 + 


( 8 ) 


i f ' 2 r Ti r Ti 

- dr, dr 2 I ([H(ti)[H(t 2 ), 
o Jfj J/j Jfj 

[H(t 3 )[H(t 2 ), dr 3 + .... 


H(t 3 )]+ 


L 

m 2 )) = e-“ lh/h Y J aim 2 )), (12) 

1=0 

where (5 = A E + E wm and li = t 2 - t\. The expansion coeffi¬ 
cients a/ = 27/(7?) and a (l = 7o(/?), where J/(R) are the Bessel 
functions of the first kind and R = hAE/h. The parameter L 
sets the order of the Chebyshev approximation by truncating 
the series ( [T2| . Strictly speaking, this scheme does not preserve 
the unitary time evolution. However, its convergence with the 
increase of L is fast so that L can be chosen such that the de¬ 
viation from unitarity is dominated by the round-off error l24l . 
We have found that it is enough to take L < 100 for N < 10 4 
and the further increase of L does not improve the accuracy of 
calculations. 

3. The algorithm 


An implementation of the expansion ([8| assumes a truncation 
of the infinite series, summation of the finite series into an op¬ 
erator F2(t|, t 2 ), and use of the latter as the propagator Uit\,t 2 ) 
j26l . The Floquet operator Uj can be approximated as a chain 

Ut = U(0,t\)U(t\,t 2 )... U(t M - 1, t M ) ~ 

(9) 

x e -iCl(0.ti)lh e ~iQ(t,.t 2 )/h e -iCl(I M -ul M )/h v ' 

where t k = kh = kT /M, k = 0, M. Since all terms on the rhs 
of Eq. ([8]) are Hermitian, the truncated operator £i(t t , t 2 ) is Her- 
mitian, and an approximation of any order preserves the unitary 
time evolution. The truncated operator in the form ([8]) is not 
very suitable for computations. It is more convenient to approx¬ 
imate Q.(t\,t 2 ) with lower-order commutator series, calculated 
by using values of H(ti/ 2 ) at the midpoints 1 1 / 2 = (fi + ?2)/2 
(this is our choice, see Section [7] for more details), or with a 
commutator-free linear combination of H(tj), calculated at dif¬ 
ferent times tj e [t\,t 2 \ f23ll25 i. 

Chebyshev expansion. The exponentiation of an operator is 
a computationally expensive operation Il27l . In order to propa¬ 
gate vector |i/z(fi)) to time t 2 , the knowledge of the unitary op¬ 
erator exp(-iQ.(ti,t 2 )/h ) is redundant: we need only the result 


We restrict the consideration to Hamiltonians of the form 


Hit) = H 0 + fit ) 

‘ ^mocb f(t+T) = m 


(13) 


where /(f) is a scalar function and Hq, H mo a are time- 
independent Hermitian operators. Most of the currently used 
models, including the ones discussed in Section [4] belongs to 
this class. Equation (13 i is the simplest nontrivial case of a 


general situation. Hit) = Ho + f s it) ■ H^ od , with s / N z . Our 
results can be generalized to the case s > 1 in a straightforward 
manner. 

Next we specify the method to approximate Cl(ti,t 2 )- As 
we discussed in the previous section, there exist a variety of 
schemes EH- Our choice is conditioned by the form of the 


Hamiltonian, Eq. (13 1 , and the intention to realize the algorithm 
on a parallel cluster. More specific on the last point, we are not 
concerned about the number of commutators needed to be cal¬ 
culated (and then stored) as long as they are all time-f/zdepended 
and do not have to be recalculated in course of the propagation. 
Here we use the midpoint approximation of the Magnus expan¬ 
sion with three commutators 


^ = ai + j^a 3 


1 

240 


[-20tri - cr3 + Ci,a 2 + C 2 ], (14) 
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so that £2 = Q + Oih 1 ). Specifically, 


a: 


C 2 


h> dj- l H(t l/2 ) 
( 7 - 1 )! dt'-' 

1 

- — [a u 2a 2 + Ci]. 


Ci = [ai,a 2 ]. 


(15) 


The original formulation demands the calculation of ay on 
every time step. This task, for the specific choice given by 
Eq. (T^, reduces to calculations of midpoint values of the 
scalar functions /(f), /'(f), and /"(f). These values have to 
be weighted with time-independent commutators of the forms 
[//(,, // lno d], [//(), [//(), // m( id]], etc. There are nine commutators 
for the chosen scheme, Eq. but they have to be calculated 
only once, when initiating the algorithm. 

The choice of the operational basis to write operators Ho and 
// m od constitutes an important point. We use of the eigenbasis of 
the operator Hq without going into the interaction picture (see 
a relevant discussion in Ref. |[25l ). In this basis equation ( [T3| 
assumes the form 


ihd, |/(0) = [diag/s, + /(f) • H mod ] |/(f)), 


(16) 


where diag E, is a diagonal matrix consisting of the eigenvalues 
{£, } of //(>, and // mo<1 is the matrix representation of the operator 
in the eigenbasis of Hq. In numerical experiments with differ¬ 
ent periodically-driven nonlinear potentials, we found that this 
choice of the basis guarantees stable performance for N > 10 3 . 
Because of the diagonal form of the matrix Hq, it also simplifies 
calculation of the nested commutators. 

The algorithm can be described as the propagation of the 
NxN identity (in the eigenbasis of Ho) matrix over the time in¬ 
terval T. The propagation is realized with a chain of M single- 
step Magnus propagators over the time interval h = T/M. by 
performing L Chebyshev iterations (cf. Eqs. (jTOjr, (12 1 ) with the 
rescaled operator Q(fj-i, 4), k - 1,..., M. Eq. (14 Note, that in 
order to apply the rescaling procedure (11 1 , Q i-* Cl, one has 
to estimate the extreme eigenvalues and E mm beforehand. 
We diagonalize the matrix £1 at five equidistant time instants 
tj e [0,7'], and use the maximal and minimal values from the 
collected eigenvalue set as E mm and E m ax . Once the propagation 
stage is completed, the result.i.e. the NxN unitary matrix Uj is 
diagonalized and its eigenvalues {e^} and eigenvectors (1/^(0))), 
are written into the output file. An additional propagation round 
can be performed, now with eigenvectors (1/^(0)} as initial vec¬ 
tors, in order to calculate relevant characteristics of the Floquet 
states. For example, it can be the expectation value of a relevant 
operator A(t), averaged over the one period 


-fl 


(A^)t = ~ <//0|A(0|/ A ,(0)£/f. 


(17) 


4. Models 

To test the algorithm, we employed two specific physical 
cases for the Hamiltonians entering the setup given by equation 

m 

The hist system is a synthesized model, with the Hamilto¬ 
nians Ho and H mod being members of a Gaussian orthogonal 



Figure 1: (color online) Parallel computation of a Floquet operator Uj • The 
initial NxN identity matrix 1L is sliced into P rectangular sub-matrices X/, 
i = l,... ,P, each consisting of N/P basis vectors. The sub-matrices are then 
independently propagated on P cluster nodes, by using the Magnus-Chebyshev 
propagation algorithm. The output vectors form the corresponding columns of 
the Floquet matrix. 


ensemble GOE(A) 833) of a variance cr, that is a parameter of 
the system. Random matrix theory and the corresponding mod¬ 
els remain at the center of research in many areas of quantum 
physics l34l . but it is only very recently that these two hitherto 
disentangled research fields started to interact Ql HU. 

Our second test model consists of a driven /V-particle Bose- 
Hubbard dimer l35l . with the Hamiltonians 

t t U 2 

H 0 = -v{a\a 2 +aia' 2 ) + —(hi - h 2 ) , 

/ (Is) 

7/mod /.2 


where a\ (a,) and hi = a\a, are the bosonic creation (annihi- 
lation) and particle number operators for the j-th site, respec¬ 
tively. Parameters v and U are the hopping rate and one-site 
interaction strength. In the Fock basis the Hamiltonian // () ac¬ 
quires a tridiagonal structures, while H mod becomes a diagonal 
matrix. This model is extensively used in many-body quantum 
physics, both in theoretical and experimental domains; e.g., see 
Ref. |36). 

5. Implementation of the algorithm on a cluster 

We now describe a program realization of the algorithm and 
its consecutive realization on a high-performance cluster. Our 
C code employs Intel® Parallel Studio XE package |29l . The 
main data structures are complex double-precision matrices. 
Computational load is distributed among cluster nodes by the 
standard Message Passing Interface (MPI). On each node com¬ 
putationally intensive operations are implemented by calling 
BLAS functions from Intel® Math Kernel Library (Intel MKL), 
in shared-memory parallel mode m. 

The code consists of three main steps (they are summarized 
in the pseudocode shown in Algorithm |T]>. In the first step, the 
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Algorithm 1 

l: initialization, memory allocation 

2: upload system & method parameters: the number of steps M per period T, the number of Chebyshev iterations L 
3: calculate basis of the stationary Hamiltonian Ho, auxiliary matrices diagfs, and /7 m<K j 
4: calculate Bessel functions Ji(R), R = R(h), l = 0, L 

5: distribute the initial sub-matrices X,, i = 1, P and precalculated data between P nodes 

6: for k = to M do > MPI distributed computational loop 

7: calculate Q[4 _j, 4 ] for the current 4 = kh, h = T/M 

8: rescale Q t-> O 

9: perform L Chebyshev iterations with O and update A, 

10: end for 

11: combine X ,• into U r 
12: diagonalize Uj 

13: save eigenvectors and eigenvalues of Uj 
14: release memory 


program initializes MPI, allocates memory, reads the seed data 
and parameters from configuration files, and makes necessary 
pre-calculations before launching the main cycle: calculates 
the eigenbasis of the Hamiltonian Ho and auxiliary matrices 
diag£), // mo d (see Eq. 161, the Bessel functions J/(R), R = R(h), 
1 = o,...,z0 needed for the Chebyshev series (see Eq. |l2|), 
and nine commutators needed for the Magnus expansion (see 
Eq. [M}. These computations are performed on each cluster 
node. It is important to choose appropriate operational presen¬ 
tation of the N x N matrices, starting from the initial identity 
matrix 11. The most straightforward solution is to split 11 into N 
vectors, store the vectors as independent arrays, and then prop¬ 
agate them independently and in parallel. A more efficient so¬ 
lution is to form sub-matrices of initial vectors that allows for 
parallel propagation and then make use of the third-level BLAS 
operations, in particular, matrix-matrix product, instead of a se¬ 
ries of matrix-vector products. As a result, the memory hier¬ 
archy would be used in a more efficient way and a substantial 
decrease of the computation time would be achieved. 

The second step involves the propagation of the initial ma¬ 
trix 11 over one period T. Because the process is iterative, a 
parallelization in time is not possible. However, a data paral¬ 
lelization is feasible. The initial identity matrix can be split 
into P sub-matrices Xj, i = 1,..., P, each consisting of N/P ba¬ 
sis vectors, which are then distributed among P cluster nodes. 
Therefore, the first N/P rows are propagated on the first node, 
the next N/P rows on the second node, etc. (according to the 
C row-major order, initial vectors are written as rows). This 
idea is sketched in Fig. |T] The scheme possesses a potential mi¬ 
nor drawback that could be encountered in the case of a large 
number of processing units, when splitting could cause a strong 
imbalance in This might affect the performance of the mathe¬ 
matical kernels, which were not developed to handle “thin” ma¬ 
trices consisting of a few rows and thus limits number of pro¬ 
cessing units that could be used to accelerate the propagation. 
The major advantage of the scheme, however, is a next to uni¬ 
form distribution of the workload among the nodes. Together 


2 The Bessel functions were numerically computed using Fortran intrinsic 
function bessel-jn. 


with a constant number of operations on each step, this allows 
to estimate the scaling of the overall computing time with P. 

A single propagation step realizes the recipe given at the 
end of Section [3] By employing MKL functions, we calculate 
the matrix fl(4-i,4) following the Magnus expansion (14 1 . It 
is computed independently on each cluster node, as the small 
computing time does not justify parallelization on a distributed 
memory. 

The computationally intensive part of the algorithm is the 
approximation of the action of the matrix exponent by Cheby- 
shev’s iterations, Eqs. ( |10|12 1 , and the further updating of prop¬ 
agated sub-matrices on each cluster node. The mathemati¬ 
cal core of this step is the multiplication of complex double¬ 
precision dense matrices (it is realized with zgemm routine 
(32l ). This part of the algorithm is fully parallel. 

During the final, third step the program assembles sub¬ 
matrices into the Floquet matrix and diagonalizes the latter 
by using a multi-threaded Intel MKL implementation (we use 
zgeev routine E3). For the matrix size N ~ 10 4 , a multi-core 
implementation is sufficient. Finally, the results of the diago- 
nalization are written to the output files, the memory is deallo¬ 
cated, and MPI is finalized. 


6. Program performance and scalability analysis 

In this section we present the performance analysis of the 
code. Test runs were performed on the “Lobachevsky” super¬ 
computer at the Lobachevsky State University of Nizhni Nov¬ 
gorod (37]]. We employed up to 64 computational nodes, with 
the following configuration per node: 2x Intel Xeon E5 - 2660 
CPU (8 cores, 2.2 GHz), 64 GB RAM, OS Windows HPC 
Server 2008. We use Intel MKL, Intel C/C + + Compiler, and 
Intel MPI from Intel Parallel Studio XE (29l . All parallel ver¬ 
sions of computationally intensive routines from MKL utilized 
16 cores on each node. 

To test the performance of the program we use the synthe¬ 
sized random-matrix model as a benchmark (see Section^. In 
this case, Hamiltonians Ho and /7 rn(Kl in Eq. ( p~6| > were gener¬ 
ated randomly from the GOE(A) ensemble of the unit variance 
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Table 1: Single-node performance: Execution times (in sec) as a function of system size N. Multi-threaded version of the code employs all 16 node’s cores on 
shared memory. Columns ii-iv and vi present data obtained for M = 10 2 time steps per period and L = 50 Chebyshev iterations on every step. To get an estimate 
for M = 10 4 , the time needed to calculate and perform Chebyshev iterations were extrapolated (last column), see text for more details. 


System size, 
N 

Auxiliary 
computations time 

Time of 

£2 calculation 

Chebyshev 
iterations time 

Diagonalization 

time 

Total time 
M = 10 2 

Total time 

M = 10 4 , extrapolation 

256 

0.2 

0.4 

4.3 

0.2 

5.1 

470.4 

512 

0.4 

1.8 

25.3 

0.6 

28.1 

2711.0 

768 

1.3 

3.2 

79.9 

1.4 

85.8 

8 312.7 

1 024 

1.8 

8.3 

177.3 

2.6 

190.0 

18 564.4 

1 536 

4.3 

18.9 

559.1 

7.3 

589.6 

57 811.6 

2 048 

8.6 

33.2 

1 296.6 

16.0 

1 354.4 

133 004.6 

3 072 

23.5 

72.1 

4 179.2 

51.0 

4 325.8 

425 204.5 

4 096 

49.2 

126.0 

9 730.5 

117.5 

10 023.2 

985 816.7 

5 120 

100.5 

184.3 

18 667.2 

242.1 

19 194.1 

1 885 492.6 

10 240 

755.3 

919.7 

181 722.3 

1 857.7 

185 254.9 

18 266 809.4 


<x = 1 l38l . The driving function is /(f) = cos(o;f) with u> = n 

ED- 

Single-node performance. To test a single-mode perfor¬ 
mance, we use L = 50 Chebyshev iterations on every step. The 
number of steps per period, M - 10 2 , was used for testing the 
program. The execution time for larger values of M can be eas¬ 
ily extrapolated: due to the linear increase of operations number 
with iterations in time, it is sufficient to find and appropriately 
scale execution time of the core part of the code, and add ex¬ 
ecution time of the other parts, which are independent of M. 
Table [T] presents the dependence of the execution time on the 
size of the model system. The last column of Table [T] presents 
estimates for the case when the number of steps is increased 
100-fold, i.e. for M = 10 4 . 

To gain a further insight, we analyze a single-node perfor¬ 
mance in some more detail. We consider two metrics, both nor¬ 
malized by the number of operations, OC v . required to make 
calculations for a system of a size N. Namely, we calculate the 
operation rate R that is the number of operations per unit time, 
and take the value obtained for N = 256 as a unit measure. The 
first metrics reads R # = (OC v /OC 2 5 c,) / (TIM E v /TIME 25 6), 
where TlMEy is the execution time of the code for the system 
of size N. Further, we consider a similar quantity, P N , where the 
number of operations is estimated by N 3 , according to the scal¬ 
ing of the most computationally intensive and most frequently 
called MKL subroutine zgemm. Table[2]presents /Ly and //■ as 
functions of N. 

The number of operations is calculated by Intel® VTune™ 
Amplifier profiler (40l and returned to CPU performance 
counter SIMD FP 256.PACKED DOUBLE gQ. This vari¬ 
able contains the number of issued Advanced Vector Extensions 
(AVX) instructions for processing double precision values (42]. 
The choice of this counter is based on the fact that almost all 
computations in our code occur in Intel MKL BLAS routines. 
Note, however, that this estimate of the number of operations 
is not exact. It is well known that for the current architectures 
the profiler tends to overestimate this number, since it counts 
the number of instructions issued but retired. Nevertheless, this 
estimate is reliable for CPU-bound processes. 

The behavior of R and P as functions of N are presented in 


Table 2: Single-node performance. Computational intensity and efficiency 
measures, Rn and Pn, as functions of the model system size N. Multi-threaded 
implementation of the algorithm on a single cluster node (16 cores on shared 
memory) was used. The number of steps per period is M = 10 2 with L = 50 
Chebyshev iterations on every step. 


System size, 
N 

Total time, 
TIMEy, in sec 

Operations count, 
OQv, in min 

Rn 

Pn 

256 

5.1 

173 612 

1.00 

1.00 

512 

28.1 

1 365 642 

1.43 

1.45 

768 

85.8 

4 594 916 

1.57 

1.60 

1 024 

190.0 

10 875 688 

1.68 

1.72 

1 536 

589.6 

36 655 208 

1.83 

1.87 

2 048 

1 354.4 

86 961 706 

1.89 

1.93 

3 072 

4 325.8 

292 683 840 

1.99 

2.04 

4 096 

10 023.2 

693 473 512 

2.03 

2.08 

5 120 

19 194.1 

1 354 053 950 

2.07 

2.13 


two last columns of Table [2] The efficiency increases with the 
size of the model system, doubling for N = 5 120 as compared 
to N = 256. That is because of the increasing efficiency in 
evaluation of larger matrices of the BLAS computational ker¬ 
nels in the parallel regime. While the efficiency grows with the 
system size, the execution time also increases, mainly because 
the increase of the number of steps per period needed, and for 
N — 5 120 the estimated calculation time is about 22 days. 
Therefore, a multi-node parallelization is required to decrease 
the calculation time to more realistic time scales. 

Strong scalability of the algorithm. We next analyze the per¬ 
formance of the algorithm on a cluster. To benchmark the code, 
we use the random-matrix model of the size N = 5 120 and 
launch the code on P = {1,2, ...,2/ ...,2 6 } cluster nodes, using 
the multi-threaded implementation on each node as before. The 
results are summarized in Table [3] Let us note that the time 
needed for the diagonalization of N — 5 120 matrix is 242.1 sec 
and does not depend on P (see column vi in Table[I]). Therefore, 
it is omitted from the further analysis. 

For M = 10 2 the code accelerates as the number of nodes 
increases to 64 (1 024 computational cores in total), though the 
efficiency of parallelization, defined as the ratio between the 
speed up and the number of nodes, drops to 35%. That is be- 
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Table 3: Scalability analysis: execution times (in sec ) of different steps of the algorithm and speed-up factors (in %) are shown as functions of the number of 
employed cluster nodes. Each node runs multi-threaded version on 16 cores on shared memory. The parameters are N = 5 120, L = 50. 


Number 
of nodes, 

P 

Auxiliary 

computations 

time 

Time of Q 

calculation 

Chebyshev 

iterations 

time 

Chebyshev 
iterations 
speed up 

Total 

time 

M = 10 2 

Speed up 
M = 10 2 

Total 

time 

M = 10 4 

Speed up 
M = 10 4 

1 

100.5 

184.3 

18 667.2 

1.0 

19 194.1 

1.0 

1 885 457.9 

1.0 

2 

186.4 

181.5 

9 406.9 

2.0 

10 017.8 

1.9 

959 150.8 

2.0 

4 

195.3 

180.2 

4 670.7 

4.0 

5 288.8 

3.6 

485 400.8 

3.9 

8 

173.6 

178.7 

2 341.2 

8.0 

2 935.8 

6.5 

252 305.2 

7.5 

16 

137.1 

178.4 

1 187.2 

15.7 

1 744.7 

11.0 

136 882.6 

13.8 

32 

103.9 

178.7 

628.0 

29.7 

1 152.6 

16.7 

81 006.1 

23.3 

64 

98.9 

178.6 

338.3 

55.2 

858.6 

22.4 

52 053.6 

36.2 



Figure 2: (color online) Computational efficiency as a function of number of 
cluster nodes P. Results are shown for two values of number of steps per period, 
M = 10 2 and M = 10 4 . The parameters are N = 5 120 and L = 50. 


cause for this, relatively small, number of integration steps the 
execution time of serial parts of the code becomes comparable 
to that of the parallel code (Chebyshev iterations), which scal¬ 
ing efficiency, in turn, is about 86%. Taking into account that 
practically reasonable computations require much larger num¬ 
bers of integration steps, it is expected that the efficiency of 
the code will increase with N. It is confirmed by the results 
of test runs, see last two columns of Table [3] In particular, for 
N = 5 120 and M = 10 4 the code is executed on 64 cluster 
nodes in 14.5 hours, demonstrating the efficiency of paralleliz¬ 
ing about 57%. Figure[2]shows the dependence of the efficiency 
on the number of employed cluster nodes. 

7. Applications 

In this section we test the accuracy of the algorithm by using 
two physical model systems which we described in Section [4] 
above. 

The random-matrix model was already specified in Section[6] 
For the dimer model, Eq. we use parameters v — 1 and 
U = U' ■ N = 2. The one-site interaction is scaled with the 
number of bosons, N - 1, to match in the limit N —> oo the 
classical mean-field Hamiltonian If36ll43ll . 

jjr 

H c \(z, v) = —z 2 - 2v^\ -z 2 cos(v) + 2z ■ fit). (19) 


The mean-field variables z and v measure the population imbal¬ 
ance and relative phase between the dimer sites, respectively. 
The driving function /(f) = + / ac cos (tot) consists of two 

components, a constant dc-bias /j c = 2.7 and single-harmonic 
term / ac cos(wf), with the amplitude / ac = 2.5 and frequency 
to — 3. We use the phase space of the mean-field system, 
Eq. ( [T9| l together with the semi-classical eigenfunction hypoth¬ 
esis 1441 and the concept of “hierarchical eigenstates’ ’ 031, for 
a “quantum” benchmarking of the program. 

Once the diagonalization of Uj is completed, the program 
initiates an additional round of T -propagation to calculate the 
average energies of the Floquet states with A(t) = H(t) in 
Eq. 07}. Finally, the Floquet states are sorted in ascending 
order according to their average energies. To quantify the accu¬ 
racy, we adapt the idea of overlap phase error Ii24l and modified 
it to account for the periodicity of the Floquet states, 

S #i = |U-<^(O)|£/ 7 -|0 #i (O)>|. (20) 

where /,(()) is the Floquet state calculated with the time step 
h = /j/2. Note that the error is state-specific. 

Figure[3]presents the error X ; , as a function of number of steps 
per period M and number of system states N. A linear depen¬ 
dence of log 10 2/ on log 10 M observed for the random-matrix 
model is typical for stepwise integrators. The error convergence 
with the number of steps does not saturate up to largest value 
M — 10 240. In the case of the dimer, however, the error does 
not reveal the power-law scaling and demonstrate a noticeable 
saturation. The difference in the scalings can be attributed to 
the differences in the spectral properties of the matrices Hq and 
// m od ; while in case of the random-matrix model the level spac- 
ings of the Hamiltonians are characterized by a probability den¬ 
sity function (pdf) with a gap near zero, the level spacings in the 
energy spectrum of the integrable dimer Hamiltonian are char¬ 
acterized by a gapless Poisson pdf 06). The Floquet ground- 
state | cj)\) turns to be the most sensitive to the discretization of 
the unitary evolution (see Fig. [3ja)). We use this state to test 
dependence of the error on the size of the model system. Fig¬ 
ure [3jb) shows that the scaling of X ; , with N is qualitatively 
similar for both models. 

Now we turn to the quantum benchmarking of the algorithm 
with the dimer model. Following the semi-classical eigenfunc¬ 
tion hypothesis 04II46 I. the Floquet states of the model in the 
limit N » 1 can be sorted according to the location of the 
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M, number of steps per period 



Figure 3: (color online) Phase error E^, Eq. <|20|, as a function of (a) number of steps per period M (N = 1025) and (b) size N of the model system, (a) The 
dependence E^ for the three Floquet states 10^): (□) groundstate, \i = 1; (V) the state from the center of the energy spectrum, fi = N/2\ (o) the highest-energy state, 
H = N. (b) The dependence Ei for the Floquet groundstate. 


state’s Husimi (or Wigner) distributions |[46l in the mean-field 
phase space. The latter is mixed in the case of the driven Hamil¬ 
tonian so that regular and chaotic regions coexists Ell 
(see dots on the Poincare sections on Fig. [4|. If the distribu¬ 
tion of a Floquet state is localized inside a regular region the 
state is labeled “regular”. When the distribution is located in 
the bulk of the chaotic sea, the corresponding Floquet states is 
“chaotic”. The regular regions, called “islands”, are often or¬ 
ganized in a hierarchical way, forming fine island-near-island 
structures. By increasing the number of bosons in the dimer, it 
is possible to resolve these structures with higher level of detail. 
It is important to understand, however, that the gradual motion 
towards the semi-classical limit does not simplify the quantum 
problem. On the contrary, the increase of N increases the size 
of the Hamiltonian matrices, the number of the basis vectors, 
and, what is most dramatic, decreases the mean spacing be¬ 
tween quasienergies e ;( (which for any N remain restricted to 
the fundamental stripe [-ha>/2, ha>/2]). Therefore, the accu¬ 
racy of calculations has to be increased in parallel, otherwise 
the numerically obtained Floquet states would represent inco¬ 
herent mixtures of actual Floquet states (while an error in the 
unitarity of the propagation may remain reasonable small). A 
theoretical interpretation of so obtained numerical results might 
lead to wrong conclusions. 

The accuracy of the scheme can be checked by using the 
chaotic - regular dichotomy. Figures 4 (e,f) show Husimi distri¬ 
butions for two regular Floquet states, while Fig. 4 (c) presents 
the distribution for a chaotic state. The accuracy can be tested 
even more carefully with a third class of quantum states, located 
on the interface between chaotic and regular ones. These hier¬ 
archical E31 states are supported by the classical phase-space 
structures located on the chaotic sea’s offshore around regular 
islands. The Husimi distribution of a hierarchical state is con¬ 
centrated in the immediate exterior of the corresponding island. 
Hierarchical states are exceptional in the sense that their abso¬ 
lute number increases sub-linearly with the number of states. 


Nhier ~ N x , x < 1> so that their relative fraction Nhi er /N goes to 
zero in the limit N — > oo ESI- These states must be carefully 
selected from the complete set of N Floquet states. 

Hierarchical states are coherent superpositions of many ba¬ 
sis vectors and therefore sensitive to the phase errors. Even a 
small mismatch in vector phases blurs the interference pattern 
and causes the flooding of the state’s Husimi distribution into 
the island. The high coherence of the superpositions is also 
a trait of the regular Floquet states but there is an important 
difference:Quasienergies e /( of the hierarchical states are ran¬ 
domly distributed over the interval [-hu>/2, ha>/2] while the 
quasienergies of regular and chaotic states tend to cluster in 
different regions ED- Because of that, phases of hierarchical 
states are most vulnerable to the error produced by the numeri¬ 
cal propagation. We selected several hierarchical states for the 
dimer model with N = 2 49(0bosons and inspect their Husimi 
distributions [? ], see Figs. 4 (a,b,d). The offshore localization 
and absence of the flooding [see zoomed distribution on Fig. 4 
(d)] are clearly visible. 

8. Summary and Outlook 

We have put forward a method to calculate Floquet states of 
periodically-modulated quantum system with N > 10 3 4 states. 
Our method is advantageous in that it is scalable and therefore 
well suited for its implementation on parallel computers. Our 
study uses massively parallel clusters as efficient devices to ex¬ 
plore complex quantum systems far from equilibrium, thus an¬ 
swering the need of several, actively developing, research fields 
involving quantum physics Ifl2l [T3l [T4l [151 [l6l [T71 [T8l [T9l l20l , 


3 We use the definition of the Husimi distribution for the dimer given in 

Refs. EH- The expression involves summation over the series of square roots 
of binomail coeffcients of the order N. We did not find an alternative expression 
which allows to avoid term-by-term summation. Although we calculated the 
Floquet states for the dimer with 10 239 bosons, we were not able to go beyond 
the limit N = 2 500 when calculating Husimi distributions. 
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Figure 4: (color online) Husimi distributions for hierarchical, chaotic and regular Floquet states of the dimer model with N = 2 499 bosons. Dots show the Poincare 
section for the mean-filed Hamiltonian, Eq. m- Circle-like formations correspond to the KAM tori ED, the solutions of the mean-field system inside regular 
islands. (a,b,d,) Hierarchical Floquet states of the quantum system: // = 118(a), n = 1 024 (b), fi = 101 (d, zoom), (c) Chaotic Floquet state, fi = 1002. (f,g) Two 
regular Floquet states: fi = 30 (f) and /i = 55 (g). 


The method particularly allows for improvements, such as 
the increase of the order of the Magnus expansion ||28|| or the 
use of commutator-free Magnus approximations (25l . With 
respect to further acceleration of the code for systems with 
N < 10 4 states, there is a promising perspective related to 
the fact that the main contribution to the computation time 
stems from the BLAS operations. These operations fit GPU 
and Intel Xeon Phi architectures very well. By our estimates, 
even a straightforward implementation of the most computa¬ 
tionally intensive Chebyshev iteration stage on a heterogeneous 
CPU+GPU configuration will result in a three-fold speedup. A 
yet further speed-up can be obtained by using multiple acceler¬ 
ators. 

There are several interesting research directions for which 
the proposed algorithm may serve as a useful starting point. 
For example, there is the perspective to resolve Floquet states 
of even larger systems by applying the spectral transformation 
Lanczos algorithm |49] to the corresponding time-independent 
super-Hamiltonians, to name but a few. Because the super- 
Hamiltonian elements can be generated on the fly, this idea po¬ 
tentially would allow to calculate Floquet states of a system 
with N ~ 10 5 states for F ~ 10 4 Fourier harmonics, by em¬ 
ploying massively parallel exact diagonalization schemes ED. 
Note, however, that the eigenvalues of the Hamiltonian super¬ 


matrix (as well as the respective quasienergies) are merely 
phase factors and are not directly related with the properties of 
the corresponding Floquet states (average energy, etc.). There¬ 
fore, some targeting of the algorithm to the low-energy states is 
required. Our method can be used to locate the relevant Floquet 
eigenvectors in the quasi-energy spectrum of a system with a 
smaller number of states; combining this with a knowledge of 
the spectrum scaling with N, one can target the Lanczos algo¬ 
rithm. 

Another direction relates to the computational physics of 
open periodically-modulated quantum systems that interact 
with a large environment (heat bath). Asymptotic states of such 
systems are affected by the combined effects of modulation and 
the decoherence induced by the environment ED. Due to lin¬ 
earity of the model equations describing the evolution of the 
density matrices of the systems, the corresponding asymptotic 
states are specified by time-periodic density matrices, which 
can be called “auntum attractors”. There is presently limited 
knowledge about the theme of quantum attractors beyond the 
limit of the rotating-wave approximation (52]. In the Floquet 
framework, the attractor’s density matrix is a zero-eigenvector 
of the corresponding non-unitary Floquet super-operator, which 
acts in the space of N x N Hermitian matrices. This super¬ 
operator can be constructed by propagating the identity opera- 
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tor - but now in the space of N x N matrices. The propagation 
stage can be realized by using Pade approximation It27l or with 
Newton or Faber polynomial schemes m. while the ques¬ 
tion whether there exists a possibility to generalize the Magnus 
expansion to dissipative quantum evolution equations remains 
open. As the number of the basis matrices scales as N 2 , the 
scalability of non-unitary propagation algorithms then presents 
an even more demanding task. 
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